AITopics

Neural Information Processing SystemsFeb-16-2026, 01:34:04 GMT

UDA

Cleaning missing values: The human-generated questions may be unanswerable. Thus, we remove the Q&A items that lack available answers. Additionally, documents lacking any valid Q&A pairs are also removed.

artificial intelligence, dataset, natural language, (17 more...)

Country:

North America > United States (0.30)
Asia > China > Shanghai > Shanghai (0.06)
Asia > Singapore (0.05)

Industry: Law (0.96)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsFeb-16-2026, 01:34:02 GMT

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured.

large language model, machine learning, natural language, (18 more...)

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Indonesia > Bali (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.46)
Banking & Finance (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

arXiv.org Artificial IntelligenceNov-10-2025

Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding

Al-Homoud, Haneen, Ibrahim, Asma, Al-Jubran, Murtadha, Al-Otaibi, Fahad, Al-Harbi, Yazeed, Toibazar, Daulet, Wang, Kesen, Moreno, Pedro J.

Abstract--Cross-Lingual SynthDocs is a large-scale synthetic corpus designed to address the scarcity of Arabic resources for Optical Character Recognition (OCR) and Document Understanding (DU). The dataset comprises over 2.5 million of samples, including 1.5 million textual data, 270K fully annotated tables, and hundred thousands of real data based charts. Our pipeline leverages authentic scanned backgrounds, bilingual layouts, and diacritic aware fonts to capture the typographic and structural complexity of Arabic documents. In addition to text, the corpus includes variety of rendered styles for charts and tables. Finetuning Qwen-2.5-VL on SynthDocs yields consistent improvements in Word Error Rate (WER) and Character Error Rate (CER) in terms of OCR across multiple public Arabic benchmarks, Tree-Edit Distance Similarity (TEDS) and Chart Extraction Score (CharT eX) improved as well in other modalities. SynthDocs provides a scalable, visually realistic resource for advancing research in multilingual document analysis.

large language model, machine learning, natural language, (18 more...)

2511.04699

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.87)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)

Neural Information Processing SystemsOct-10-2025, 06:57:44 GMT

UDA

dataset, source dataset, uda dataset, (15 more...)

Country:

North America > United States (0.30)
Asia > China > Shanghai > Shanghai (0.06)
Asia > Singapore (0.05)

Industry: Law (0.96)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Neural Information Processing SystemsOct-10-2025, 06:57:41 GMT

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

arxiv preprint arxiv, computational linguistic, dataset, (13 more...)

Country:

Asia > Singapore (0.04)
North America > United States > Washington > King County > Seattle (0.04)
Asia > Indonesia > Bali (0.04)
(5 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Media (0.46)
Banking & Finance (0.46)
Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Kang, Lei, Fu, Xuanshuo, Terrades, Oriol Ramos, Vazquez-Corral, Javier, Valveny, Ernest, Karatzas, Dimosthenis

LLM-Driven Medical Document Analysis: Enhancing Trustworthy Pathology and Differential Diagnosis

arXiv.org Artificial IntelligenceJun-25-2025

Medical document analysis plays a crucial role in extracting essential clinical insights from unstructured healthcare records, supporting critical tasks such as differential diagnosis. Determining the most probable condition among overlapping symptoms requires precise evaluation and deep medical expertise. While recent advancements in large language models (LLMs) have significantly enhanced performance in medical document analysis, privacy concerns related to sensitive patient data limit the use of online LLMs services in clinical settings. To address these challenges, we propose a trustworthy medical document analysis platform that fine-tunes a LLaMA-v3 using low-rank adaptation, specifically optimized for differential diagnosis tasks. Our approach utilizes DDXPlus, the largest benchmark dataset for differential diagnosis, and demonstrates superior performance in pathology prediction and variable-length differential diagnosis compared to existing methods. The developed web-based platform allows users to submit their own unstructured medical documents and receive accurate, explainable diagnostic results. By incorporating advanced explainability techniques, the system ensures transparent and reliable predictions, fostering user trust and confidence. Extensive evaluations confirm that the proposed method surpasses current state-of-the-art models in predictive accuracy while offering practical utility in clinical settings. This work addresses the urgent need for reliable, explainable, and privacy-preserving artificial intelligence solutions, representing a significant advancement in intelligent medical document analysis for real-world healthcare applications.

differential diagnosis, large language model, machine learning, (17 more...)

2506.19702

Country: Europe > Spain (0.28)

Genre: Research Report > Promising Solution (0.48)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Diagnostic Medicine (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.69)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Diagnosis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Thomas, Eliott, Coustaty, Mickael, Joseph, Aurelie, Deloin, Gaspar, Carel, Elodie, D'Andecy, Vincent Poulain, Ogier, Jean-Marc

QUEST: Quality-aware Semi-supervised Table Extraction for Business Documents

arXiv.org Artificial IntelligenceJun-24-2025

Automating table extraction (TE) from business documents is critical for industrial workflows but remains challenging due to sparse annotations and error-prone multi-stage pipelines. While semi-supervised learning (SSL) can leverage unlabeled data, existing methods rely on confidence scores that poorly reflect extraction quality. We propose QUEST, a Quality-aware Semi-supervised Table extraction framework designed for business documents. QUEST introduces a novel quality assessment model that evaluates structural and contextual features of extracted tables, trained to predict F1 scores instead of relying on confidence metrics. This quality-aware approach guides pseudo-label selection during iterative SSL training, while diversity measures (DPP, Vendi score, IntDiv) mitigate confirmation bias. Experiments on a proprietary business dataset (1000 annotated + 10000 unannotated documents) show QUEST improves F1 from 64% to 74% and reduces empty predictions by 45% (from 12% to 6.5%). On the DocILE benchmark (600 annotated + 20000 unannotated documents), QUEST achieves a 50% F1 score (up from 42%) and reduces empty predictions by 19% (from 27% to 22%). The framework's interpretable quality assessments and robustness to annotation scarcity make it particularly suited for business documents, where structural consistency and data completeness are paramount.

information retrieval, machine learning, natural language, (20 more...)

2506.14568

Genre:

Workflow (0.66)
Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.89)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Unsupervised or Indirectly Supervised Learning (0.56)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

arXiv.org Artificial IntelligenceJun-23-2025

Enhancing Document-Level Question Answering via Multi-Hop Retrieval-Augmented Generation with LLaMA 3

Huang, Xinyue, Lin, Ziqi, Sun, Fang, Zhang, Wenchao, Tong, Kejian, Liu, Yunbo

--This paper presents a novel Retrieval-Augmented Generation (RAG) framework tailored for complex question answering tasks, addressing challenges in multi-hop reasoning and contextual understanding across lengthy documents. Built upon LLaMA 3, the framework integrates a dense retrieval module with advanced context fusion and multi-hop reasoning mechanisms, enabling more accurate and coherent response generation. A joint optimization strategy combining retrieval likelihood and generation cross-entropy improves the model's robustness and adaptability. Experimental results show that the proposed system outperforms existing retrieval-augmented and generative baselines, confirming its effectiveness in delivering precise, contextually grounded answers. Understanding complex question answering (QA) tasks requires deep comprehension of documents containing numbers, legal texts, and intricate language.

large language model, machine learning, natural language, (20 more...)

2506.16037

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report > New Finding (0.34)

Industry: Law (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Rodríguez, Adrià Molina, Terrades, Oriol Ramos, Lladós, Josep

The OCR Quest for Generalization: Learning to recognize low-resource alphabets with model editing

arXiv.org Artificial IntelligenceJun-19-2025

Achieving robustness in recognition systems across diverse domains is crucial for their practical utility. While ample data availability is usually assumed, low-resource languages, such as ancient manuscripts and non-western languages, tend to be kept out of the equations of massive pretraining and foundational techniques due to an under representation. In this work, we aim for building models which can generalize to new distributions of data, such as alphabets, faster than centralized fine-tune strategies. For doing so, we take advantage of the recent advancements in model editing to enhance the incorporation of unseen scripts (low-resource learning). In contrast to state-of-the-art meta-learning, we showcase the effectiveness of domain merging in sparse distributions of data, with agnosticity of its relation to the overall distribution or any other prototyping necessity. Even when using the same exact training data, our experiments showcase significant performance boosts in \textbf{transfer learning} to new alphabets and \textbf{out-of-domain evaluation} in challenging domain shifts, including historical ciphered texts and non-Latin scripts. This research contributes a novel approach into building models that can easily adopt under-represented alphabets and, therefore, enable document recognition to a wider set of contexts and cultures.

artificial intelligence, machine learning, recognition, (17 more...)

2506.06761

Country: Europe > Switzerland (0.14)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Inductive Learning (0.64)